feat: add Vietnamese language support by xirothedev · Pull Request #1013 · oramasearch/orama

xirothedev · 2026-02-11T17:28:35Z

Summary

Adds Vietnamese (vi) language support to Orama, including:

Stemmer (packages/stemmers/lib/vi.js): Identity function — Vietnamese is an isolating (analytic) language where words do not inflect, so no morphological stemming is needed
Stopwords (packages/stopwords/lib/vi.js): Common Vietnamese function words (conjunctions, prepositions, particles, pronouns, etc.)
Splitter regex in languages.ts: Covers all Vietnamese diacritics (ă, â, ê, ô, ơ, ư, đ and all tone marks)
Build scripts: Updated both stemmers and stopwords build scripts
Tests: Added tokenizer test case for Vietnamese

Why Vietnamese?

Vietnamese is spoken by ~100 million native speakers and is one of the most widely used languages in Southeast Asia. Adding support enables Orama users to build search for Vietnamese content.

Technical Notes

Vietnamese is an isolating language — unlike European languages, words do not change form (no conjugation, declension, or inflection). This means:

The stemmer is intentionally an identity function (return word)
Tokenization relies on the splitter regex and stopwords filtering
Vietnamese uses spaces between syllables, so the default space-based tokenization works correctly

Test plan

Verify pnpm test passes for the tokenizer tests
Verify build scripts generate correct dist output for Vietnamese

Vietnamese is an isolating (analytic) language where words do not inflect, so the stemmer is an identity function. Adds: - Vietnamese stemmer (identity function) - Vietnamese stopwords list - Vietnamese splitter regex with full diacritics support - Tokenizer test for Vietnamese

Vietnamese diacritics encode distinct vowels and tones, so folding them to ASCII changes word meaning. normalizeToken applied replaceDiacritics to every language, which partially stripped Vietnamese tokens (e.g. "tài" -> "tai", "trình" -> "trinh") while leaving Latin Extended Additional characters intact, producing inconsistent, lossy tokens. Skip replaceDiacritics for languages whose diacritics are significant, tracked by a new LANGUAGES_WITH_SIGNIFICANT_DIACRITICS set (currently Vietnamese). This fixes the new Vietnamese tokenizer test, which expects full diacritic preservation.

…exports - Drop dead underscore-joined compound stopwords (có_thể, một_cách, ngay_cả): the tokenizer splits on whitespace, so these never match real input, and their head words (có, một, ngay) are already individual stopwords. - Commit the generated @orama/stemmers/vietnamese and @orama/stopwords/vietnamese package.json exports (produced by `pnpm build`), matching the existing sanskrit convention so the published packages expose the subpaths.

xirothedev and others added 3 commits February 12, 2026 00:27

thatjuan merged commit 6af0f8b into oramasearch:main Jun 26, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add Vietnamese language support#1013

feat: add Vietnamese language support#1013
thatjuan merged 3 commits into
oramasearch:mainfrom
xirothedev:feat/add-vietnamese-language

xirothedev commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

xirothedev commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why Vietnamese?

Technical Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xirothedev commented Feb 11, 2026 •

edited

Loading